Machine learning-based predictive model for prevention of metabolic syndrome

Metabolic syndrome (MetS) is a chronic disease caused by obesity, high blood pressure, high blood sugar, and dyslipidemia and may lead to cardiovascular disease or type 2 diabetes. Therefore, the detection and prevention of MetS at an early stage are imperative. Individuals can detect MetS early and manage it effectively if they can easily monitor their health status in their daily lives. In this study, a predictive model for MetS was developed utilizing solely noninvasive information, thereby facilitating its practical application in real-world scenarios. The model’s construction deliberately excluded three features requiring blood testing, specifically those for triglycerides, blood sugar, and HDL cholesterol. We used a large-scale Korean health examination dataset (n = 70, 370; the prevalence of MetS = 13.6%) to develop the predictive model. To obtain informative features, we developed three novel synthetic features from four basic information: waist circumference, systolic and diastolic blood pressure, and gender. We tested several classification algorithms and confirmed that the decision tree model is the most appropriate for the practical prediction of MetS. The proposed model achieved good performance, with an AUC of 0.889, a recall of 0.855, and a specificity of 0.773. It uses only four base features, which results in simplicity and easy interpretability of the model. In addition, we performed calibrations on the prediction probability and calibrated the model. Therefore, the proposed model can provide MetS diagnosis and risk prediction results. We also proposed a MetS risk map such that individuals could easily determine whether they had metabolic syndrome.


Introduction
Metabolic syndrome (MetS) is a chronic disease caused by obesity, high blood pressure, hyperglycemia, and dyslipidemia [1]. Although there are slight differences in the details, there are five common risk factors: fasting plasma glucose, blood pressure, triglycerides, high-density lipoprotein cholesterol, and waist circumference. MetS is diagnosed if more than three factors among these are abnormal [1]. MetS has emerged as a major public health concern worldwide owing to the prevalence of MetS in adults in many urbanized countries steadily increasing to 20-30%. Furthermore, MetS increases the risk of cardiovascular disease and type 2 diabetes [2]. The prevalence of MetS in South Korea adults was reported to be 22.9% in 2018 [3]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 To improve this situation, noninvasive predictive studies have been conducted to easily detect and prevent MetS early. Noninvasive predictive models do not use invasive information obtained by penetrating the body or skin, such as blood tests, so continuous monitoring is possible at a simple, fast, and low cost. Noninvasive predictive studies have been published mainly in European and Asian countries [4][5][6][7][8][9][10][11][12]. Since 2015, many noninvasive studies have been conducted, and each study was conducted using samples of various nationalities, sizes, age groups, and prevalence (Table 1).
Most studies have been conducted on lifestyle-related and anthropometric features [4][5][6][7]9,11,13]. Gutiérrez-Esparza [13] attempted to find important features in lifestyle-related information. Gutiérrez-Esparza [13] viewed gender as an important factor and performed feature selection and model composition. However, in the final models, anthropometric features were evaluated as the main features, and although some lifestyle features were included in the final model, their roles were not significant [4,5]. The number of features used in the predictive models was between 4 and 17; more features tended to be used when lifestyle features were included, and the model became more complex.
The overall performance of the models was between 0.84 and 0.93 in terms of AUC, and most of them tended to have higher specificity than recall. Wang's study [6], which showed the best performance (AUC 0.93) using an artificial neural network, was characterized by the cumulative use of longitudinal data collected three times to increase performance. Fifteen features were used for prediction, including features of lifestyle and socioeconomic status, as well as physical features (waist circumference, age, and sex).
From an algorithmic point of view, interpretable models, such as decision trees (DTs) and logistic regression (LR), were half of the previous studies presented in Table 1, and the other half were hard-to-interpret black box models, such as ensembles, artificial neural networks, and random forests. Romero-Saldaña [9,10] constructed a simple rule-based decision tree using only the waist-to-height ratio and blood pressure. The AUC was not reported, and the specificity was quite high at 0.9 or higher, while the recall was low at 0.55 and 0.78. However, calibrations for the predictive probabilities were not evaluated, focusing only on classification performance. Datta [8] and Wang [6] performed calibrations, achieving good results in terms  of AUC. However, there is a limitation in that it is difficult to interpret predicted results due to the relatively large number of features and the high complexity of the model. We can achieve satisfactory accuracy when we apply predictive models from previous studies to real life, but they require many features (information) for prediction and do not provide predictive probabilities. However, some models are difficult to interpret. A practical MetS predictive model should achieve satisfactory accuracy with minimal features and explain the prediction results such that they are understandable. Furthermore, if a predictive model predicts both the presence or absence of a disease and the risk probability, it will be more helpful for understanding health status.
In this study, we developed a practical predictive model to help prevent MetS. First, we explored the most informative features to obtain a sufficient predictive performance. We developed novel synthetic features for candidate features and performed feature selection. Second, we focus on tree-based classification algorithms. We constructed models from the basic DT (CART) to ensembles (Random Forest, Extreme Gradient Boosting) and deep learningbased trees (TabNet) and compared their performance by AUC, sensitivity, specificity, balanced accuracy, and a number of features. Third, we propose a MetS management tool that visually constructs a predictive model. The outcome of the tree model is expressed in a decision structure that has the advantage of high interpretability. We propose a visual tool (MetS risk map) for MetS prevention by reconstructing decision structures in a more user-friendly form and adding risk probabilities. Fig 1 depicts the process used to develop our MetS machine learning predictive model. We used the health checkup records to develop the model. These records contained the core elements of anthropometry and blood test results that could identify MetS. The data included survey results on lifestyle, diet, family history, and medical history. We extracted as many features (data attributes) as possible from these data to discover informative features for diagnosis and investigated previously known indicators. We also synthesized new anthropometric features using waist circumference, blood pressure, and diagnostic criteria and created new dietaryrelated features by borrowing the evaluation items of the Korean healthy eating index [14] and inflammatory index [15]. After excluding subjects with outliers and missing values, all features were arranged in a tabular data set.

Procedure
In this tabular dataset, we divided 10% of the data into a test dataset for performance evaluation. The rest were used as a training dataset for model learning, and 10% of it was re-divided into a validation dataset, which was repeated 30 times to enable a stable performance comparison when finding the best learning model. Except for the validation and test datasets, only the training dataset was undersampled to adjust the ratio of cases with and without MetS to 1:1.
Using the training/validation sets, we built five machine learning models based on LR, CART (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), and TabNet (TN). Feature selection was performed for the five models to determine the optimal features. After the feature selection, we performed parameter tuning for the five models to show as high an AUC as possible. Finally, we evaluated the prediction accuracy of the models using a test dataset. We also evaluated the practicality of each learning model based on performance metrics, such as recall, specificity, and balanced accuracy, as well as the calibration plot and Brier score, number of features, and interpretability.

Raw data
This study was based on health checkup records collected from South Korea: eight major metropolitan cities (Seoul, Incheon, Daegu, Busan, Gwangju, Ulsan, Daejeon, and Sejong) and eight other provinces (Gyeonggi-do, Gangwon-do, Gyeongsangnam-do, Chungcheongbukdo, Chungcheongnam-do, Jeollanam-do, Jeollabuk-do, and Jeju-do). The records were obtained from the Korea Genome and Epidemiology Study conducted by the Korea Disease Control and Prevention Agency [16]. The survey collected lifestyle, medical history, dietary habits, food intake, and anthropometric and clinical measurements to identify risk factors for chronic diseases common to Koreans. They include all the factors necessary for diagnosing MetS: waist circumference, systolic and diastolic blood pressure, fasting glucose, triglycerides, and HDL cholesterol. The survey was conducted from 2004 to 2013, and records of 173,209 adults aged 40 years or older were collected.
A total of 70,370 participants were selected from the source records and used according to the following criteria: 1) subjects aged < 70 years; 2) exclusion of subjects with the following diseases that can affect dietary habits: hypertension, diabetes, hyperlipidemia, stroke, fatty liver, angina pectoris, thyroid disease, and cancer; 3) exclusion of subjects with missing values and outliers regarding diet, blood tests, and anthropometric measurements. Table 2 summarizes the characteristics of the selected participants, focusing on the MetS factors. The Institutional Review Board (IRB) of Dankook University granted approval for the study protocol and waived the requirement for obtaining informed consent from participants (DKU 2021-06-008).

Preprocessing and feature synthesis
We used the presence or absence of MetS as a class label and defined MetS based on the criteria proposed in 2005 by the revised National Cholesterol Education Program-Adult Treatment Panel III (revised NCE APT III) [1]. Waist circumference for abdominal obesity followed the criteria suggested by the Korean Society for Obesity [17], which is recommended for Koreans. In summary, the diagnostic criteria for MetS in this study were as follows: 1) increased waist circumference (�90 cm for males and �85 cm for females); 2) elevated blood pressure (systolic blood pressure �130 mmHg or diastolic blood pressure �85 mmHg); 3) elevated fasting blood glucose (�100 mg/dl); 4) elevated triglycerides (�150 mg/dl); 5) reduced HDL cholesterol (�40 mg/dl for male and �50 mg/dl for female). A state that exceeded the standard was defined as having risk factors, and a participant with three or more of the following risk factors was diagnosed with MetS. We extracted as many features (predictors) as possible to identify informative features for diagnosing MetS. We targeted only noninvasive measurement items, excluding blood test items. Table 3 summarizes the features extracted from the health checkup records. In total, 237 features were extracted and classified into three types according to their attributes: anthropometric, survey-based, and synthesized. Anthropometric features consist of body information measured by professional examination institutions and body shape-related features synthesized using this information. These synthesized features are anthropometric indices that describe body fat distribution [4]. Survey-based features contain lifestyle-related information, such as food intake, drinking, smoking, and exercise, and were collected through questionnaires [16].
In addition to these two types of features, we synthesized new features. Waist circumference and blood pressure, which are noninvasive information and risk factors for MetS, were synthesized as follows: The measured x of waist circumference or blood pressure was scaled based on each diagnostic criterion and then applied to the sigmoid series function. In the case of blood pressure, c in the denominator was substituted with 45, which is the difference between systolic and diastolic blood pressure. In addition, the final synthetic feature of blood pressure was higher after obtaining systolic and diastolic synthetic features. WC and BP, which are the basis materials for the other synthetic features, have the following properties: 1) this value ranges between 0 and 1; 2) when this value is 0.5, the original value is the same as the diagnostic criterion; 3) as this value approaches 1, it significantly exceeds the diagnostic criterion; 4) this value more sensitively reflects changes near the diagnostic criterion. Table 4 summarizes the details of the features, and further details can be found in our previous study [18]. Moreover, additional features were created by synthesizing these basic unit features (BP and WC) again and calculating lifestyle-related features according to each measurement item of the Korean health eating index [14] and the dietary inflammatory index [15] (Tables 3 and  4). Synthetic features are marked "Yes" in the Synthetic column in Table 3, and their names are capitalized to distinguish them from the raw features. In this study, "raw features" refer to features extracted from a single piece of information, such as height, weight, and waist circumference, while synthetic features refer to new features made using two or more raw features.

Prepare train/validation/test sets
We divided the dataset into training and test datasets at a ratio of 9:1. We re-split the training dataset at a 9:1 ratio and used a small portion as a validation dataset. Because there was an imbalance between classes, the prevalence of MetS was 13.6%, and the ratio of MetS to nonmetabolic cases was adjusted to 1:1 using the undersampling method. Undersampling was performed only on the training dataset, and the original MetS prevalence was maintained in the validation and test datasets. The process of separating the validation dataset from the training dataset was repeated 30 times to construct a dataset containing various possible non-MetS cases (Table 5).

Build candidate models
Feature selection. In the candidate model-building phase, feature selection and parameter tuning were performed for the five classification algorithms. We attempted to screen no more than 10 final features for the practical use of diagnostic models. Among the three types of features classified in Table 3, the most informative features were selected after three rounds, as shown in Fig 2. Each round also performs a feature selection process, as shown in Fig 3. First, three rounds followed the following procedure: Round 1 selects no more than 10 best features from each of the anthropometric and survey-based features. Round 2 combines each selected feature and then selects no more than 10 best features from the feature set. Finally, Round 3 selects no more than 10 best features from the feature set that combines the features selected in Round 2 with our proposed synthetic features.
Each step followed the following procedure: First, we evaluated the feature importance and selected the top 30 or fewer features. Feature importance was evaluated using the method provided by each classifier, and LR was based on the exponential conversion of the coefficients. In the second step, we selected up to 10 features that we had previously obtained using the recursive feature elimination (RFE) method. The RFE is a wrapper feature selection method, which is a model-dependent method based on the evaluation of the learning model used. In this process, we used the RFE and RFECV functions of sklearn and the AUC as the evaluation criteria. In the third step, we selected the best performance feature set from all 10 possible feature combinations. We constructed the model using each feature combination and selected the feature set of the models with the highest AUC.
Classification algorithms. The selection criteria for the five models are delineated as follows: First, LR was chosen as a benchmark model for performance comparison. LR, a conventional algorithm alongside DT, is renowned for its interpretability. DT and TN were chosen based on their interpretive characteristics [19][20][21]. DT possesses an innate quality of being easily comprehensible by non-experts if it has a reasonable number of nodes. TN, a recent deep learning model, can generate feature maps through a learnable mask, representing features highly correlated with prediction results. It was postulated that TN's interpretability could render it the most appropriate option if it exhibits a discernable difference in performance compared to existing models. RF and XG were selected for performance evaluation because the

PLOS ONE
ensemble model RF and boosting family XG are known to be the highest performing models in tabular data [22][23][24]. Consequently, the performance of these models served as an upper limit to gauge the positioning of interpretable models DT and TN. The rationale for selecting the DT-series algorithm lies in its inherent flexibility, which can be attributed to two primary characteristics [25]. First, decision trees are categorized as nonparametric methods, thereby implying that they are not constrained by any assumptions pertaining to the distribution of the space. Second, decision trees are distance-based models that neither require normalization nor scale conversion and are robust to the presence of outliers.
TN is a novel deep neural network (DNN) architecture that utilizes a decision-tree-based approach to handle tabular data [21]. TN is capable of 1) processing raw data without any preprocessing, 2) selecting features in an instance-wise manner using sequential attention, and 3) mimicking an ensemble by sequentially repeating DNN blocks called "Steps" [21]. The key element of TN is the learnable mask used for feature selection, which enables the implementation of output manifolds similar to those of Decision Trees. The TN architecture is built by repeating the Step building block, where each Step receives attention information from the previous Step, learns the mask, selects the features, and outputs the results.  Parameter tuning. Parameters were tuned using the ParameterGrid function of the Model Selection class provided by sklearn. First, the parameters were preset for each classifier according to the preset parameter column in Table 6 to generate the basic models for parameter tuning. All other non-predefined parameters used default values. Then, as shown in the grid parameter column in Table 6, the optimal combination was determined by changing the values of the main parameters for each classifier. A grid search was conducted using AUC as the evaluation criterion.

Model calibration
Calibration is an essential component of predictive model evaluation for medical decisionmaking, diagnosis, and prognosis [26]. Calibration is a measurement of how well the predicted probability of an event matches the true underlying probability of the event [27]. In practice, a good calibration means that a predicted probability of 0.9 actually occurs with a probability of 0.9 [27]. Clinically, the probability of occurrence can also be interpreted as a risk that has practical significance. In the decision-making process, it is more useful to refer to continuous values for risks, such as probability, rather than simply broad classifications, such as MetS [27]. Therefore, it is important that our models are well-calibrated and have good discrimination.
We used three methods to calibrate the predictive probability of a diagnosis model: Platt scaling, isotonic regression, and Pozzolo's calibration [28,29]. These methods are designed for binary classification and require the use of an independent calibration set to obtain good calibration probabilities [29].
Platt scaling is the most effective method for calibrating SVM prediction probabilities when the predicted probabilities are distorted in a sigmoid shape [29]. The calibrated probability is obtained by passing the output f(x) of the diagnostic model through the sigmoid function: Parameters A and B are fitted using the maximum likelihood estimation method from the fitting set, and gradient descent is used to find the following solution [29]: Isotonic regression is a more generalized method, with the only restriction being that the mapping function is monotonically increasing (isotonic) [29]. The basic assumption in isotonic is: where m is an isotonic function, f i is the prediction from the model, and y i is the class label. Then, given a fitting set (f i , y i ), the isotonic regression problem is to find the isotonic function m such thatm Isotonic regression has the advantage of being able to calibrate any monotonous distortion well, whereas overfitting is likely to occur when data are scarce [29].
Pozzolo's method corrects the predictive probability of the undersampled model [28]. Undersampling resulted in a mismatch in the distribution between the training and test sets. In other words, the learning model was based on the distribution of the training set, but the test set used in the evaluation was similar to the distribution before undersampling. Therefore, it is necessary to adjust the bias caused by the difference between these two distributions in the predictive probabilities of the learning model [28]. The bias-corrected probability p 0 is obtained using the following equation: where β is the probability of selecting an undersampled negative instance from all negative instances and p s is the predictive probability of a model trained on undersampled datasets. The advantage of Pozzolo's method is that it is not only possible to calculate the optimal threshold in a simple way based on mathematical theory, but it also does not require additional fitting sets for calibration. The optimal thread hold equals the probability of selecting a positive from the entire dataset. Calibration is typically measured as a set of predictions and not as an individual prediction. It is impossible to directly measure the true underlying probability of a one-time event because only one event occurs or does not occur [27]. The Brier score is a typical method and is the mean squared error for a set of predictions between the actual and predicted probabilities [26]. Given a set of predictionsp with true probabilities p, the Brier score is A lower score indicates better accuracy, but no "good" criterion has been established [27]. Therefore, another measurement is required to determine whether the calibration is significant, such as the Spiegelhalter z-test [26]. This method presents a criterion for determining the significance of calibration by decomposing the Brier score. Spiegelhalter's z-test is defined as z ¼ P n i¼1 ðy i Àp i Þð1 À 2p i Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where y i is the ith true class label andp i is the ith predicted probability. Statistically significant scores (i.e., z<-1.96 or z > 1.96) generally indicate poor calibration because the values of z follow an asymptotically standard normal distribution, and the null hypothesis is that the model is well-calibrated [27].
There are ways to evaluate calibrations using a graphical approach to compensate for the limitations of summary statistics, such as the Brier score and Spiegelhalter's z-test statistic. A calibration plot, also called a reliability plot, is a graph that connects the corresponding points with the prediction probability on the x-axis and the actual probability on the y-axis. The plot includes a diagonal line that is fully calibrated. The advantage of the calibration plot is that miscalibration patterns can be easily identified [27].

Model comparison metrics
In medical research, the AUC is widely used for discriminant evaluation [26,27,30]. The ROC plot depicts the trade-off between recall and specificity. In the plot, the x-axis denotes recall, yaxis the specificity, and AUC the area under the ROC curve.
Recall and specificity are two components that measure the validity of diagnostic models with dichotomous predictions [30]. Comparing the predicted diagnosis with the actual health status, it was divided into four cases: True positive (TP), False positive (FP), True negative (TN), and False negative (FN). TP is a case in which a patient with the disease is predicted to be positive. FP is a case in which a patient without a disease is predicted to be positive. TN occurs when a patient without a disease is predicted to be negative. FN is a case in which a patient with a disease is predicted to be negative. The recall of a diagnosis refers to the ability of the model to correctly identify patients with the disease, whereas the specificity of a diagnosis refers to the ability of the model to correctly identify patients without the disease: The AUC also has robust properties in terms of prevalence because recall and specificity are not affected by the prevalence of the disease [30]. In addition, the AUC can have values between 0 and 1 because the two axes of ROC are recall and specificity with values between 0 and 1; the AUC of the pure random model is 0.5, and the AUC of the perfect model is 1 [27]. We also used balanced acuity to evaluate discrimination, as it is a robust indicator of prevalence.

Feature selection
Selected features and their importance. Seventeen features were selected from the five classifiers, as listed in Table 7. There were eight anthropometric features, six of which were our proposed synthetic features. The proposed features are WC, BP, BPWC_add, BPWC_mul, BPWC_dif, and bWC, and the features of previous studies are CUNBAE and WHR. These synthetic features were composed of raw features, as listed in Table 8. From the raw feature point of view, the proposed synthetic features were mainly based on waist circumference, systolic and diastolic blood pressure, and sex, and the features proposed in previous studies consisted of age, sex, weight, height, waist circumference, and hip circumference. The selected lifestyle-related features were carbohydrate energy, fat energy, grain, retinol, kimchi, green vegetables, leaf tea, lettuce, and non-smoker (see Table 9 for the meaning of each feature). Most were related to food or nutrient intake, and only the current smoking status was related to lifestyle.
In all classifiers, our proposed synthetic features were not only selected as important features but also ranked high (Fig 4). Specifically, synthetic features composed of WC, BP, and variations of these two were selected. Existing synthetic features and lifestyle-related features followed our synthetic features: CUN-BAE, WHR, carbohydrate energy, non-smoker, grain, retinol, kimchi, fat energy, green vegetables, leaf tea, and lettuce (see Tables 8 and 9 for the meaning of these features).
Based on the classification model, the total number of features was DT (3) < TN (4) < LR (5) < XGB (6) < RF (8), in ascending order. Based on the number of raw features, the order N: The number of features. The proposed synthetic feature is an asterisk before the feature name. Raw features refer to features of a single piece of information, such as height, weight, and waist circumference, and synthetic features are created using these raw features.
https://doi.org/10.1371/journal.pone.0286635.t007     Candidate models Model parameters. Table 10 lists the results of the parameter tuning for each classifier. These parameters were used to build the final candidate model for each classifier. Table 7 lists the final selected features used by each candidate model. Performance comparison. Table 11 summarizes the performances of the candidate models. Except for LR, the performance of all candidate models improved after parameter tuning. The DT showed the most noticeable performance improvement over the other models, with an AUC of 0.792-0.886. Comparing the models with optimized parameters based on AUC,   Table 12 and Fig 6 show the results of the evaluation using a test dataset after applying calibrations to the optimal models. First, as shown in Fig 6, when calibration was not applied, the predicted probability in all classification models was overestimated compared to the actual probability. However, there are some differences in each method, but the prediction probability is well corrected by applying the calibration overall. As shown in Table 12, the Brier score

Comparison of candidate models
The characteristics of the calibrated models based on the analysis thus far are summarized in Table 13. The characteristics of the model were compared in terms of four aspects:  discrimination, calibration, ease of use of features, and interpretability. All the models were evaluated using the same test dataset. From the perspective of discrimination, XGB exhibited the highest performance at 0.896. RF and TN(0.893), LR(0.89), and DT(0.889) followed. However, the gap between the best and poor performances was 0.007, and the difference in discrimination between the models was not noticeable. Recall, specificity, and balancing accuracy showed similar patterns to AUC. In terms of calibration, it also showed significant performance without notable differences between the models.
When comparing the number of raw features required for prediction, the DT was the smallest with four, followed by XGB, TN, LR, and RF. The top three performance models, XGB, TN, and XGB, used more than eight raw features. The DT only required raw features of less than half of the other prediction models. Furthermore, unlike LR, DTs have the convenience of not requiring a preprocessing process, such as scale, when using features.
Each model was calibrated using different calibration methods, and the results were significant when evaluated using the Brier score, Spiegelhalter z-score, or p-value. Thus, in all models, the predictive probability can be interpreted as the actual probability, that is, the risk of developing MetS. In addition to the interpretability of predictive probabilities, LR, DTs, and TN are characterized by easy interpretation of the model itself. On the other hand, RF and XGB have poor interpretation of predictive results in an ensemble of numerous trees.

Decision of final MetS predictive model
The five classification algorithms produced prediction models with similar performances. In this case, the simpler the model, the better. Therefore, the number of features used was the criterion for the model selection. The decision tree used the fewest raw features (systolic and diastolic blood pressures, waist circumference, and sex) compared to the other models, and these features were also easy to collect. Furthermore, DTs have several advantages: first, they do not need assumptions about data such as LR; second, they can be used directly as predictors without preprocessing such as scaling; third, they are easy to interpret as the model itself structurally internalizes the decision process. RF, XGB, and TN are tree-based models that can be used without preprocessing and have nonparametric model properties, but RF and XGB are difficult to interpret as ensemble models. On the other hand, TN can interpret the prediction results by instance, but it is not as intuitive as a DT. Therefore, we determined that the DT is a practical model with many advantages over performance when comprehensively considering discrimination, calibration, ease of use of features, and interpretability. Fig 7 shows an example of how the final selected decision-making model works. When the user provides information on systolic blood pressure, diastolic blood pressure, waist circumference, and sex as input values, these four pieces of information become raw features and are converted into synthetic features called BP and WC. BP and WC were synthesized once more to produce three synthetic features, BPWC_add, BPWC_mul, and BPWC_dif, which were used as the final input values for the prediction model. The model then outputs the predictive results of whether this user has MetS, what is the probability, and how many times the probability of developing the disease compared to the average. For example, if a woman had a systolic blood pressure of 140 mmHg, diastolic blood pressure of 90 mmHg, and waist circumference of 89 cm, these measurements were first converted to 0.84 (BP) and 0.66 (WC). After that, the conversion values are once again converted to 1.50 (BPWC_add), 0.55 (BPWC_mul), and 0.18 (BPWC_dif) and input into the model. Based on this input value, the model diagnosed MetS with a probability of 0.31 (risk) and provided information that the risk probability was 2.25 times more likely to develop than the average.

Decision tree and metabolic syndrome risk map
We devised a "MetS risk map" with WC and BP as axes by interpreting the structured results of the decision-making process. The decision tree outputs the result of structuring the decision process in the form of the plot (A) or text (B), as shown in Fig 8. We decomposed the classification rules for each node, as shown in Fig 8B, and expressed them on a plane with WC and BP as axes. This was possible because the DT model used only three features represented by the relationship between WC and BP: BPWC_mul = BP * WC, BPWC_add = BP + WC, and BPWC_dif = BP-WC. DTs divide the space using vertical or horizontal lines; however, we were able to divide the space by diagonal and curve using the relationship of these features.  Each divided region of the MetS risk map corresponded one-to-one to the terminal node of the DT. The MetS risk map was completed by matching the risk of MetS using the calibrated probability for each region (Fig 9). Risk is the calibrated probability divided by an adjusted threshold. The threshold was adjusted using Pozzolo's method [28] and was found to be 0.137, similar to the prevalence of MetS in the population. Therefore, the risk can be interpreted as the number of times the probability of incidence is higher than the prevalence of MetS in the entire population.
The MetS risk maps were divided into three zones. These zones were formed by two lines, as shown in Fig 9B. The first zone is the lower part of the area divided by BP+WC =0.66 and is a safety zone for MetS. Most regions were classified as non-MetS, and the risk of development was much lower than 1. The second zone is the upper part of the area divided by BP×WC =0.31 and is a risk zone for MetS. All regions were classified as having MetS, and the risk of development was > 2. The third zone is the area between the two lines and is a warning zone. This zone gradually progresses to MetS and is the most important zone for prevention. In more detail, the important region for prevention can be narrowed down to the region indicated by the gray zone in Fig 9B. The gray zone is the region where the risk increases rapidly compared to the adjacent non-MetS regions. At the same time, MetS and non-MetS existed at similar rates in this area. Combining these two facts, we can infer that the gray area is the path to active conversion to MetS.

Summary of results
In the results section, we describe various aspects of the developed MetS prediction models. The summary is as follows: Specifically, synthetic features based on BP and WC were evaluated as being the most important among all classifiers. The synthetic feature uses only waist circumference, systolic and diastolic blood pressures, and sex as the base features, which includes all classifiers in common.
➁ In the analysis of the predicted probability of the models, we found a tendency to overestimate MetS in all classifiers and calibrated it to reduce the estimation error. Therefore, the probability predicted by the calibrated model was indicative of the risk of developing MetS.
➂ We selected the DT model as the final predictive model for MetS. It used the fewest features for prediction but derived an almost similar performance to the other models. Four raw features, namely waist circumference, systolic and diastolic blood pressures, and sex, were used, which have the advantage of being easily measured in daily life. The decision tree model is simple and has transparent properties that can be used to understand the decision structure.
➃ We devised a MetS risk map by reconstructing the decision structure of the final model as a two-dimensional plane and mapping the risk probability to each region.

Discussion
We developed a predictive model for MetS that utilizes only noninvasive information, making it practical for use in real-world scenarios. While fasting blood sugar, triglycerides, and HDL cholesterol are important factors in diagnosing MetS, we deliberately excluded features that require blood testing when developing our predictive model, to ensure its preventive usability.
The proposed model has three major advantages for the preventive management of MetS. The first advantage is that the features required for prediction are just four easily measurable features: waist circumference, systolic and diastolic blood pressure, and sex. Second, the predictive model provides the degree of risk along with the diagnosis of MetS, enabling individuals to cope effectively with preventive management. Third, prediction results can be easily understood by individuals, and prediction models can be provided as visual tools reconstructed in a simple map form. These three advantages are also consistent with the technology acceptance model (TAM), which is a theory on the properties of information technology to be well received in society. According to TAM, the higher the perceived usefulness and perceived ease of use of technology, the higher the acceptability of the technology [31]. Perceived usefulness is related to usefulness and productivity for the task, and perceived ease of use has been embodied, such as clear, understandable, and low mental effort [32]. Perceived usefulness is also affected by perceived ease of use; that is, it is recognized as more useful when the user is easy to use [33].
Using five classification algorithms, we identified 17 noninvasive raw features useful for predicting MetS (Fig 5). At the center, systolic and diastolic blood pressures, waist circumference, and sex were directly related to MetS diagnostic criteria. We synthesized four key anthropometric features to create BP and WC features. These novel features, including various variants using BP and WC, were of higher importance in predicting MetS than synthetic features, such as CUN-BAE, BRI, and BMI, as proposed in previous studies. This result was presumed to be due to the inherent properties of BP and WC. The synthetic features reflect a certain section that is more important around the diagnostic criteria, as shown in Fig 10. The two axes, WC and BP, of the MetS risk map can be interpreted similarly to the distance from the diagnostic criteria of abnormal factors in MetS. To be exact, BP and WC are values that consider about 10% of the diagnostic criteria more important and have nonlinearity similar to sigmoid, expressed as a value between 0 and 1. BP and WC had an actual diagnostic criterion of 0.5. Therefore, if this value approaches 0.5, it is close to the diagnostic criterion, and it exceeds the diagnostic criterion if it exceeds 0.5. Furthermore, the 0.25-0.75 interval, which is half the value between 0 and 1, corresponds to 10% of the diagnostic value before and after the diagnostic criterion. That is, BP is the weighted position of blood pressure with respect to the diagnosis criterion, and WC is the weighted position of waist circumference with respect to the diagnosis criterion. This property fits well with the perspective of preventing chronic diseases. It is more effective in the prevention of looking at a certain section with greater risk than looking at all steps with the same importance.
Lifestyle-related features were not evaluated as important as anthropometric features. This result was also reported in previous studies [4][5][6][7][8][9][10][11][12][13]. However, given that many studies have reported an association between lifestyle and MetS, we speculate that the way lifestyle-related information is collected was not sufficient to reveal its characteristics. In fact, Tabares et al. [34] recently reported that increasing physical activity levels and lowering BMI by at least 2% reduced the risk of developing MetS by 3.8% but added that increasing physical activity without weight loss had little effect on prediction. This finding disproves that lifestyle influences are observable when accompanied by meaningful physical changes. Therefore, it is necessary to examine whether the lifestyle data used in this study contain sufficient information accompanying physical changes. In the case of the dietary data used in this study, the frequency of food intake was collected through questionnaires, which was a form of responding to the monthly/weekly/day unit while recalling the "average frequency of intake over the past year" [35]. If it is collected several times in cycles shorter than a year, we expect it to be different from the current results. In fact, a study using three follow-up datasets reported that the performance was improved by using cumulative survey data [6]. In addition, precise and dense lifestyle data are expected to accumulate as healthcare technologies, such as smartwatches, smart bands, and diet management apps have become popular. Therefore, future studies are needed to identify important lifestyle features for MetS prediction based on these data.
The DT finally achieved an AUC of 0.889, recall of 0.855, and specificity of 0.773. Compared to previous studies, its performance is difficult to compare directly because of differences in race, population size, and prevalence of MetS, but it is similar in terms of AUC (see Table 1). However, our model was characterized by higher recall than specificity. Individuals without MetS are more likely to be classified as having MetS compared to previous research models, but from the standpoint of preventive management, it is appropriate to conservatively diagnose suspected patients and induce additional checkups. In addition, when comparing studies on the MetS prediction model of Koreans over the past decade [36], to the best of our knowledge, this study is the first based on noninvasive information from large-scale Koreans.
Of the studies listed in Table 14, two studies ( [9,13]) identified gender-specific differences in model performance when considering prevalence-robust metrics. Specifically, male models outperformed female models in both studies, with a male balanced accuracy of 0.807 and female of 0.646 in [13] and a male recall of 0.594 and female of 0.409 in [9]. While key anthropometric features were consistent across genders, variations were observed in food-related features. Therefore, we developed the final decision tree (DT) model separately for men and women and compared their performance (Table 14): the balanced accuracy was 0.872 for men and 0.890 for women, and the recall was 0.843 for men and 0.850 for women. Both gender models shared common features such as BPWC_add and BPWC_mul (Fig 11), while the female model included additional anthropometric and dairy-related food features. However, when tested on the same dataset, the integrated model performed better for men based on AUC, while no significant difference was observed for women. Consequently, the study concludes that the individual models' impact on feature selection and performance was insignificant.
From a positive predictive value (PPV) perspective, it is possible to divide the risk map into three distinct areas: green, yellow, and red ( Fig 12A). The green zone refers to a section that is not associated with MetS, where the risk of developing metabolic syndrome is 0 to 1 times or lower than the average prevalence of 13.7%. The yellow zone refers to a section in which the  Table 15 summarizes the positive predictive values (PPV) for each gender in different zones. The Yellow Zone indicates relatively low PPV for both males (0.258) and females (0.241), implying that only one out of four predicted individuals are likely to have metabolic syndrome. In contrast, the Red Zone exhibits higher PPV for males (0.725) and females (0.633) compared to the Yellow Zone, with males having a higher PPV. These findings suggest that 6 to 7 out of 10 predicted individuals are highly likely to have metabolic syndrome. However, the use of binary classification diagnostic criteria to calculate PPV is inadequate for the prevention of metabolic syndrome, which is a chronic disease that develops progressively over time. Our focus, therefore, should be on providing individuals with opportunities to manage the condition before it progresses to a more severe stage. Thus, in this study, we analyzed FP from the perspective of severity, rather than solely on the presence of metabolic syndrome. To achieve this, we employed the risk index proposed in our previous work [18] and obtained a risk distribution for the FP cases. The resulting risk distribution is presented in Fig 12B and 12C. Our risk index employs a diagnostic threshold of 0.547 for identifying MetS [18]. However, MetS may also occur at values as low as 0.45, leading us to classify individuals in the MetS risk group as having a score of 0.45. Using this criterion, we identified 552 of the 1176 FP cases in the Yellow Zone as true positives, resulting in an adjusted PPV of 0.582, up from 0.248. Similarly, in the Red Zone, 123 FP cases were reclassified as true positives, resulting in an adjusted PPV of 0.686, up from 0.673. Nevertheless, to enhance PPV significantly, additional research is imperative to enhance the performance by configuring individual models for each zone or region exhibiting multiple misclassifications.
Although there is a limit to the low PPV, our final model can effectively help the decisionmaking process in preventing and managing MetS by providing development risks as well as good discrimination and recall. Previous studies have focused on the diagnosis itself, and the evaluation of the prediction probability has been overlooked. Some studies conducted a calibration for the predicted probability but did not theoretically present a threshold to distinguish the presence or absence of MetS after correction. However, this study further expands the interpretation of the results by calibrating the overestimated prediction probability using the method proposed by Pozzolo [28]. In addition to being able to interpret the predicted probability as a MetS risk, it was also possible to present how serious the state is based on theoretically clear thresholds.
In addition to semantic interpretations of predictive probabilities, such as risk, we devised a way to explain the rationale for predictive outcomes. MetS risk maps are designed to provide synergy by gathering them in one place with the structural interpretability of the DT, the meaning of the proposed synthetic features (BP and WC) themselves, and prediction probabilities that can be interpreted as risks. MetS risk maps provide a clear guide to healthcare by representing two boundaries where health conditions change significantly. Based on these boundaries, each individual can have a clear perception of whether they are in a safe zone, warning zone, or risk zone for MetS. In addition, the MetS risk map forms a gray zone (Fig 9B) where the conversion of MetS begins in earnest so that the subjects in the zone can take or receive more active management measures. Clinicians can recommend and effectively explain appropriate tests and treatments to patients by referring to the predicted risk for each region of the MetS risk map. Finally, the MetS risk map can also be used as a visual tool to monitor MetS, such as the blue dots in Fig 9B, when measured periodically because one prediction result is represented in plane coordinates. The results of this study are limited to the characteristics of the population such as race, age, and prevalence. Therefore, our results may not be generalizable to other populations. We used a dataset of middle-aged Koreans in their 40s and 60s and found that the prevalence of MetS was 13.6%. We also excluded subjects undergoing blood pressure and cholesterol-related treatment and those taking blood pressure and cholesterol-related drugs. Moreover, it should be noted that the analysis of this study is based on datasets collected from 2004-2013, so there is a time difference of more than 10 years. However, we expect future studies to allow us to compare and evaluate the performance of other populations because model development follows a procedure to maintain representation within a given population.
Supporting information S1 Fig. MetS risk map with raw values. The WC axis, representing waist circumference, displays distinct values based on gender (M for males and F for females). The BP axis transforms systolic (S) and diastolic (D) blood pressure values into equivalent values between 0 and 1, respectively, and ultimately selects the larger of the two. The raw values displayed in S, D, M, and F are rounded to one decimal place, which corresponds to the values transformed to a range between 0 and 1. As an example, for females with a waist circumference of 83 cm, systolic blood pressure of 131 mmHg, and diastolic blood pressure of 88 mmHg, the corresponding values for waist circumference, systolic blood pressure, and diastolic blood pressure are 0.4, 0.6, and 0.7, respectively, and the final blood pressure value is 0.7.